Video Thumbnail
2:09
1:29
clock icon Created with Sketch. 2 minutes

Solution: Itertools


Manuel Escalona

Here my solution:

from dataclasses import dataclass
from faker import Faker
import random
import itertools
import operator

@dataclass
class Person:
name: str
age: int
city: str
country: str

# Instantiate the Faker module
fake = Faker()

# List of possible countries
countries = [
"UK",
"USA",
"Japan",
"Australia",
"France",
"Germany",
"Italy",
"Spain",
"Canada",
"Mexico",
]

# Generate 1000 random Person instances
PERSON_DATA: list[Person] = [
Person(fake.name(), random.randint(18, 70), fake.city(), random.choice(countries))
for _ in range(1000)
]

def is_older_than(age: int, threshold: int = 21):
return age >= threshold

def main() -> None:
filtered_data: list[Person] = []
for person in PERSON_DATA:
if person.age >= 21:
filtered_data.append(person)

# filter persons by age
filtered_data2 = list(itertools.filterfalse(lambda x: not is_older_than(x.age), PERSON_DATA))

filtered_data2.sort(key=operator.attrgetter("country"))

an_iterator = itertools.groupby(filtered_data2, key= lambda p : p.country)

summary = {country: len(list(group)) for country, group in an_iterator}
print(f"Summary: {summary}")

if __name__ == "__main__":
main()

REPLY
Andreas [ArjanCodes Team]

Nice solution Manuel!

There are some remarks I would like to make:
* Variables that are named with uppercase letters are usually not calculated, keep it lowercase if it is calculated
* Using list is not needed here, it will return the same type that is inputed, which in this case is a list
* Typing for PERSON_DATA is not needed, it will be inferred for the list itself. However, it is not wrong, if you want to keep it, do it. However, then I would argue that variables needs constants then

REPLY
Manuel Escalona

Thanks, Andreas for your suggestions, will keep them in mind

REPLY
patrick ruejoma

I used the Counter function from the collections library
import collections as c
import itertools as it

def main()....
filtered_data = list(it.filterfalse(lambda person: person.age < 20, PERSON_DATA))
summary: dict[str, int] = {}
summary = dict(c.Counter(person.country for person in filtered_data))

REPLY
Andreas [ArjanCodes Team]

Nice solution! Some minor improvements can be made. First, directly import the Counter object instead of importing the whole library. Second, we do not need to create a summary variable with the type annotation before the counter call. That way, we also do no need to set the type annotation, since it can be inferred

REPLY
Roberto

why the groupby does not work without sorting the data first ?

REPLY
Arjan Egges

Hi Roberto, groupby groups *consecutive items* from an iterable that have the same key value. This is why the iterable should be sorted by the key before using `groupby`, otherwise items with the same key that are not consecutive won't be grouped together.

Let's say you have the following list of numbers and you want to group them by their value:

numbers = [1, 2, 2, 1, 3, 2]

If you were to use groupby`directly on this list, you would get:

1: [1]
2: [2, 2]
1: [1]
3: [3]
2: [2]

This is because groupby simply groups consecutive items with the same key. The number 1 appears in two separate groups and the number 2 appears in two separate groups as well.

If you first sort the list (so the list becomes [1, 1, 2, 2, 2, 3]), then using groupby will result in this:

1: [1, 1]
2: [2, 2, 2]
3: [3]

In the challenge code, sorting ensures that all persons from the same country are adjacent to each other, so groupby can group them into a single group. Hope that clarifies it!

REPLY
Roberto

Thanks a lot ! I thought it worked like pandas groupby

REPLY
Arjan Egges

You're welcome!

REPLY
Marc Nakhleh

Great problem!
Looking at a couple of different approaches (filterfalse vs generator expressions, groupby vs defaultdict vs Counter) was a nice refresher on the cost of sorting lists:

timeit(filterfalse_groupby, number=1000)
timeit(filterfalse_defaultdict, number=1000)
timeit(filterfalse_counter, number=1000)
timeit(generator_groupby, number=1000)
timeit(generator_defaultdict, number=1000)
timeit(generator_counter, number=1000)

2.442
1.301
1.280
2.255
1.146
1.145

Boosting the Persons list from range(10_000) to range(100_000):

50.180
21.197
21.496
47.090
15.741
15.362

REPLY
Arjan Egges

Thanks for posting those numbers - definitely a big cost difference between each!

REPLY
Michael Brittain

You could also combine with collections to write in 2 lines:

filtered_data = itertools.filterfalse(lambda person: person.age < 21, PERSON_DATA)
summary = Counter(person.country for person in filtered_data)

REPLY